Kira Loshin
In mid-2020, as the Coronavirus pandemic continued to have a global effect on the world economy, the United States created the Payment Protection Program (PPP) to enable small businesses to continue to keep their workers on payroll. This program, which was run through the Small Business Association (SBA), allowed small businesses (< 500 employees) to apply for a low-interest private loan equal to about 10 weeks of pay. If the business continued to stay open and pay its employees, the loan was then forgiven by the government. The PPP had two rounds of loans: one in June 2020 and one in January 2021.
The loan amount was calculated as follows: $$L = \dfrac{yearly\_salary}{12} * 2.5$$
The way this funding worked was that first a small business would calculate the payroll costs. For example, if a small business which comprised of a single employee made \$100,000 a year, they would apply for a loan totalling \\$20,833. They would apply for this loan through a lender (known as the originating lender). If their loan was approved, the SBA would release the funds to the lender, who would then disburse the loan to the loan applicant.
However, many small business have not had their approved loan funds disbursed. Can we find any reasons why some businesses may have received their funds while others have not? Can we attribute this to something specific to a small business or are there larger things at play with the originating lenders? When doing some initial data analysis on this dataset, there were some interesting things discovered about potential fraud both within the small businesses as well as the banks processing ther loans.
This analysis will look at individuals who applied for a PPP loan.
import numpy as np
import sklearn.datasets
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
import requests
import folium
from folium.plugins import TimestampedGeoJson
import pgeocode
import statsmodels.formula.api
import statsmodels.api as sm
import statistics as stats
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
import warnings
We will be using several Python packages to complete to complete this analysis. Some of the more important ones that will be used are Pandas, Seaborn, BeautifulSoup, Folium, and ScikitLearn. Each webpage links to the library documentation for each package. These are each great resources to learn more about the libraries used here.
pd.set_option('display.max_columns', None)
warnings.filterwarnings('ignore')
Data from the PPP can be found here on the SBA website. This dataset is comprised of several millions of rows. To make it more feasible for analysis, without loss of generality I decided to limit my analysis just to loans from small businesses in the state of Illinois. Illinos was selected for its relatively high ranking for business, population size, and political lean. This reduced the number of loans down to about 650,000, a more managable amount for this analysis.
The attributes for this dataset can understood as follows:
loans = pd.read_csv('il_loans.csv')
loans.head()
The loan amount request was calculated by determing the 10 week salary amount for the business. These values can be verified by comparing the salaries for each loan's profession to the published mean salaries. This data is published by the US Bureau of Labor Statistics and can be found here.
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
try:
res = requests.get("https://www.bls.gov/oes/current/oes_nat.htm", headers=headers)
except HTTPError as e:
print(e)
bs = BeautifulSoup(res.text, 'html.parser')
tab = bs.findAll("table")
df = pd.read_html(str(tab[1]))
merged = df[0]
merged = merged[1:] # Remove the first row, since it is only NaN
merged.head()
In order to compare occupations in the PPP data to occupations in the salary data, a mapping must be done between the two datasets. This can be done by using the NAICSCodes column in the PPP dataset and the Occupation Title column in the salaries dataset. To get a mapping for them, I took an excel mapping found on the Census Bureau website here. These will later be cleaned and translated for use.
code = pd.read_excel('6-digit_2017_Codes.xlsx')
code.head()
Now, let's begin our data processing by cleaning up the datasets. Each dataset will need to be separately modified before it can be pieced together into a complete dataset. There are a few general things that will be done during data cleaning and data processing. Each dataset will have extraneous (non-relavent) columns deleted. This will make processing the data happen much more quickly for later analysis. We'll also transform any columns with dates into datetime columns, casting data (when appropriate), and split strings as needed.
To clean the PPP dataset, we want to drop columns we won't need for analysis. The primary columns we will be looking at are: BorrowerZip, InitialLoanAmount, NAICSCode
loans.drop(columns=['LoanNumber', 'SBAOfficeCode', 'ProcessingMethod', 'BorrowerName', 'BorrowerAddress', 'BorrowerCity', 'BorrowerState', 'Term', 'SBAGuarantyPercentage', 'FranchiseName', 'ServicingLenderLocationID', 'ServicingLenderAddress', 'ServicingLenderCity', 'ServicingLenderState', 'ServicingLenderZip', 'ProjectCity', 'ProjectCountyName', 'ProjectState', 'ProjectZip', 'CD', 'UTILITIES_PROCEED', 'PAYROLL_PROCEED', 'MORTGAGE_INTEREST_PROCEED', 'RENT_PROCEED', 'REFINANCE_EIDL_PROCEED', 'HEALTH_CARE_PROCEED', 'DEBT_INTEREST_PROCEED', 'OriginatingLenderLocationID', 'OriginatingLenderCity', 'OriginatingLenderState'], inplace=True)
dates = pd.to_datetime(loans["DateApproved"])
loans = loans.drop(columns=["DateApproved"])
loans["DateApproved"] = dates
dates = pd.to_datetime(loans["ForgivenessDate"])
loans = loans.drop(columns=["ForgivenessDate"])
loans["ForgivenessDate"] = dates
dates = pd.to_datetime(loans["LoanStatusDate"])
loans = loans.drop(columns=["LoanStatusDate"])
loans["LoanStatusDate"] = dates
For the purpose of our analysis, we will only be looking at loans where the 'JobsReported' was 1. This is to specifically look out for individuals who may be taking advantage of the PPP loan or who may have been taken advantage of. It also will provide us with a direct metric for measuring proposed salaries against the reported average salaries.
single = loans[loans['JobsReported'] == 1]
Later on in our analysis, we will be mapping zip codes. The package we will be using to complete this analysis assumes each zip code is five digits. We transform the data here to meet that requirement.
single['Zipcode'] = single['BorrowerZip'].str[:5]
single.drop(columns=['BorrowerZip'], inplace=True)
The only relevant information for this dataset is the mean salary for each occupation. Here we will drop the unnecessary columns, create our salary column by removing unnecessary strings, and rename our columns to better reflect the data each column contains.
merged["Annual mean wage"] = merged["Annual mean wage"].replace('[\$,]', '', regex=True)
merged["Annual mean wage"] = merged["Annual mean wage"].replace('(4)', np.nan)
merged["Annual mean wage"] = merged["Annual mean wage"].astype(float)
merged.drop(columns=['Level','Employment', 'Employment RSE', 'Employment per 1,000 jobs', 'Median hourly wage', 'Mean hourly wage', 'Mean wage RSE'], inplace=True)
avg_sals = merged.rename(columns={"Occupation title (click on the occupation title to view its profile)": "Title", "Annual mean wage": "Salary"})
The NAICS dataset contains the mapping between occupations and NAICS codes. There are some formatting issues that need to be adjusted by shifting the data and the columns need to be renamed to better match the columns from the other datasets. This will make merging the datasets more straightforward once we merge them in a little bit.
code = code[1:] # Remove the first row, since it is only NaN
code = code.iloc[: , :-1] # drop last column
code['2017 NAICS Code'] = code['2017 NAICS Code'].astype(int)
code = code.rename(columns={"2017 NAICS Code": "NAICSCode", "2017 NAICS Title":"Title"})
Now that our datasets are clean, let's merge them together. Merging the single loans dataset and the NAICS code dataset provides us with a reference to the salary dataset that we will utilize later.
vals = pd.merge(single, code, on="NAICSCode")
vals.head()
Now that we have our clean dataset, it's time to move on to the next part of the data science pipeline: Exploratory Data Analysis. Using our dataset, we'll try and use some data visualization techniques to discover trends in the data. Some questions we'd like to answer are:
Once we answer some of these initial questions, we can try and pull out specific factors which may have an effect on the things we are checking. Later on, we will test whether these factors have an actual effect on the data (or whether they are simply a conincidence).
vals[vals['UndisbursedAmount'] > 0].count()
From here we can see that nearly 42,000 of the loans are still undisbursed! Let's try and take a closer look at what's going on here.
First, let's look at how many loans are not yet disbursed. We'll filter our data to only reflect loans which have not been disbursed. Remember that undisbursed loans mean that the government has paid the lender the loan amount-- the lender just hasn't given that money to the borrower. Since the money is currently with the lender, let's first see if there is anything we can learn about the lenders themselves.
# Add a new column, percent undisbursed, which is a fraction of how much of a loan is still undisbursed
vals['PctUndisbursed'] = vals['UndisbursedAmount'] / vals['InitialApprovalAmount']
# Only keep rows where there is still undisbursed funds
vals_undisbursed = vals.drop(vals[vals['UndisbursedAmount'] == 0].index)
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x="OriginatingLender",data=vals_undisbursed).set_title("Undisbursed Loans: Originating Lender")
Looking at this count, there appears to be quite a few outliers! Let's see if we can get a closer look at which lenders these are. We'll do this by filtering our dataset to only show us lenders who have > 1000 undisbursed loans.
lender_undispersed = vals_undisbursed[vals_undisbursed.groupby('OriginatingLender').OriginatingLender.transform('count')>1000].copy()
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x="OriginatingLender",data=lender_undispersed, order = lender_undispersed['OriginatingLender'].value_counts().index).set_title("Undisbursed Loans: Originating Lender, top offenders")
We can clearly see seven lenders which are holding onto undisbursed funds. They are: Capital Plus Financial, LLC, BSD Capital, LLC, Harvest Small Business Financial, LLC, Prestamos CDFI, LLC, Benworth Capital, Customers Bank, and Leader Bank, National Association. We'll hang onto these lenders and take a closer look at them later on in the tutorial.
Next, let's see if there are any trends between the type of job an individual has and whether that has any effect on whether the loan remains undisbursed. We will once again plot the count and filter based on outliers.
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x="NAICSCode",data=vals_undisbursed).set_title("Undisbursed Loans: NAICS Codes")
vals_sort = vals_undisbursed[vals_undisbursed.groupby('NAICSCode').NAICSCode.transform('count')>500].copy()
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x="NAICSCode",data=vals_sort, order = vals_sort['NAICSCode'].value_counts().index).set_title("Undisbursed Loans: NAICS Codes, top codes")
sns.set(rc = {'figure.figsize':(15,8)})
sns.violinplot(x="NAICSCode", y="PctUndisbursed", data=vals_sort).set_title("Violin Plot of Undisbursed Loans: NAICS Codes, top codes")
There once again seems to be several outliers here. Looking at a violin plot of the data, we can see that there is no real variation for each of these loans: each one seems to be completely undisbursed. Despite there seeming to be a little bit of variation for a couple of the codes, the white dot in the center of each of those plots at 1.0 (and the relative flatness of each plot) indicates that these loans can be considered completely undisbursed as well. Let's take a closer look at these NAICS codes. What are these jobs? Is there something unusual here that may be preventing the loan from getting disbursed?
We'll begin this analysis by matching these jobs with the corresponding salary information from the salary dataset. Since there is no direct connection between the NAICS codes and salary occupation code, we'll manually match each of these salaries to the closest matching value.
order = vals_sort['Title'].value_counts().index
vals_comp = vals_sort.copy()
# Let's look at what these outlier salaries are:
print(order)
# Matching each of these salaries to a corresponding value
a = avg_sals.loc[avg_sals['Title'] == 'Barbers, Hairdressers, Hairstylists and Cosmetologists']
b = avg_sals.loc[avg_sals['Title'] == 'Barbers']
c = avg_sals.loc[avg_sals['Title'] == 'Passenger Vehicle Drivers, Except Bus Drivers, Transit and Intercity']
d = avg_sals.loc[avg_sals['Title'] == 'Food Service Managers']
e = avg_sals.loc[avg_sals['Title'] == 'Construction Laborers']
f = avg_sals.loc[avg_sals['Title'] == 'Retail Sales Workers']
g = avg_sals.loc[avg_sals['Title'] == 'Personal Care and Service Occupations']
h = avg_sals.loc[avg_sals['Title'] == 'Janitors and Cleaners, Except Maids and Housekeeping Cleaners']
i = avg_sals.loc[avg_sals['Title'] == 'Landscaping and Groundskeeping Workers']
j = avg_sals.loc[avg_sals['Title'] == 'Office and Administrative Support Workers, All Other']
k = avg_sals.loc[avg_sals['Title'] == 'Childcare Workers']
l = avg_sals.loc[avg_sals['Title'] == 'Business Operations Specialists']
m = avg_sals.loc[avg_sals['Title'] == 'Securities, Commodities, and Financial Services Sales Agents']
n = avg_sals.loc[avg_sals['Title'] == 'Artists and Related Workers, All Other']
o = avg_sals.loc[avg_sals['Title'] == 'Food Preparation Workers']
p = avg_sals.loc[avg_sals['Title'] == 'Heavy and Tractor-Trailer Truck Drivers']
q = avg_sals.loc[avg_sals['Title'] == 'Home Health and Personal Care Aides']
r = avg_sals.loc[avg_sals['Title'] == 'Driver/Sales Workers and Truck Drivers']
rows = [a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r]
t = pd.concat(rows)
toComp = ['Beauty Salons ', 'Barber Shops ', 'Taxi Service ', 'Caterers',
'Residential Remodelers ',
'All Other Miscellaneous Store Retailers (except Tobacco Stores) ',
'All Other Personal Services ', 'Janitorial Services ',
'Landscaping Services', 'All Other Business Support Services ',
'Child Day Care Services ', 'Business Associations ',
'Sales Financing ', 'Independent Artists, Writers, and Performers ',
'Food Service Contractors', 'General Freight Trucking, Local ',
'Home Health Care Services', 'Couriers and Express Delivery Services']
# Matching the column information
t['titleToComp'] = toComp
Now we'll merge our new dataset. We'll do this on our initial dataset (with both disbursed and undisbursed values) to see how these jobs compare to the larger dataset. We'll also introduce two new variables:
sal_to_comp = pd.merge(vals, t, left_on='Title', right_on='titleToComp')
sal_to_comp.head(5)
sal_to_comp['RemainingUndisbursed'] = False
sal_to_comp.loc[sal_to_comp['UndisbursedAmount'] != 0, 'RemainingUndisbursed'] = True
sal_to_comp['SalaryExp'] = 12 * (sal_to_comp['InitialApprovalAmount']/2.5)
Let's pull out the top 6 NAICS codes to get a closer look. We'll graph each of these plots by adding an axis line for the average salary of each profession and a hue to determine if there are remaining undisbursed funds.
salons = sal_to_comp.loc[sal_to_comp['NAICSCode'] == 812112]
barbers = sal_to_comp.loc[sal_to_comp['NAICSCode'] == 812111]
drivers = sal_to_comp.loc[sal_to_comp['NAICSCode'] == 485310]
caterers = sal_to_comp.loc[sal_to_comp['NAICSCode'] == 722320]
remodelers = sal_to_comp.loc[sal_to_comp['NAICSCode'] == 236118]
retailers = sal_to_comp.loc[sal_to_comp['NAICSCode'] == 453998]
_ , ax = plt.subplots(1,3, figsize=(16,6))
sns.violinplot(x="NAICSCode", y = 'SalaryExp', data = salons, hue='RemainingUndisbursed', ax = ax[0]).set_title("Salon Salaries")
l0=ax[0].axhline(32990,color='black',ls='--')
sns.violinplot(x="NAICSCode", y = 'SalaryExp', data = barbers, hue='RemainingUndisbursed', ax = ax[1]).set_title("Barber Salaries")
l1=ax[1].axhline(38050,color='black',ls='--')
sns.violinplot(x="NAICSCode", y = 'SalaryExp', data = drivers, hue='RemainingUndisbursed', ax = ax[2]).set_title("Taxi Driver Salaries")
l2=ax[2].axhline(34360,color='black',ls='--')
_ , ax = plt.subplots(1,3, figsize=(16,6))
sns.violinplot(x="NAICSCode", y = 'SalaryExp', data = caterers, hue='RemainingUndisbursed', ax = ax[0]).set_title("Caterer Salaries")
l0=ax[0].axhline(61000,color='black',ls='--')
sns.violinplot(x="NAICSCode", y = 'SalaryExp', data = remodelers, hue='RemainingUndisbursed', ax = ax[1]).set_title("Remodeler Salaries")
l1=ax[1].axhline(43000,color='black',ls='--')
sns.violinplot(x="NAICSCode", y = 'SalaryExp', data = retailers, hue='RemainingUndisbursed', ax = ax[2]).set_title("Retail Worker Salaries")
l2=ax[2].axhline(29010,color='black',ls='--')
It's interesting to note here that while there does not seem to be any correlation between the amout request and whether the funds were disbursed, the salary expectations for each of these jobs are significantly higher than what the average should be. For example, barber shops and retail workers consistently are claiming that their salaries are almost double average (about \$100,000). In several of these cases, there are individuals requesting the equivalent of several hundred thousand dollars (an entire order of magnitude above average). One other interesting point is that the average salary request for undisbursed funds for taxi drivers is almost twice that of taxi drivers whose funds were disbursed. We will continue our analysis keeping these ideas in mind.
We'll now look at a few different demographic metrics for our previously determined NAICS codes. These plots should help us determine whether these attributes can later be used for predictive analysis.
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x="NAICSCode",data=vals_sort, order = vals_sort['NAICSCode'].value_counts().index, hue='Gender').set_title("Undisbursed Funds: Gender")
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x="NAICSCode",data=vals_sort, order = vals_sort['NAICSCode'].value_counts().index, hue='Veteran').set_title("Undisbursed Funds: Veteran Status")
sns.set(rc = {'figure.figsize':(15,8)})
sns.countplot(x="NAICSCode",data=vals_sort, order = vals_sort['NAICSCode'].value_counts().index, hue='Race').set_title("Undisbursed Funds: Race")
Looking at this demographic data, we can see that the majority of borrowers did not answer questions related to gender, race, and veteran status. These are all cases of Missing Not At Random data. There are a few different ways for resolving this. One method is to impute the data and test how it effects our datasets. Another method would be to look at the census data for each borrower's zip code and fill in the best approximation of each borrower's demographics. In our case, we will drop these attributes when completing our predictive analysis- since it was majority unanswered, these demographics were presumably not relevant when approving the loans.
The last attribute we will analyze in our exploratory data analysis phase is location data. We will see if there is anything we can discern based on address information for each borrower. To do this, we need to get the latitude and longitude of each address. This process can take an extremely long time complete, so instead of running the address code more than once, we have instead run it in a separate window and exported the addresses into a csv that can be loaded in. The code to build this csv is provided, but commented out. In this case, we will be looking at a subset of coordinates to limit the size of the file.
This map was built using Folium (referenced above). Each dot on the map represents the amount of the loan that is still undisbursed.
import csv
with open('coords.csv') as f:
reader = csv.reader(f)
data = list(reader)
data = data[::2]
data = data[1:10001]
coords_to_test = single[:10000]
coords_to_test['coords'] = data
coords_to_test['PctUndisbursed'] = coords_to_test['UndisbursedAmount'] / coords_to_test['InitialApprovalAmount']
'''
coords = []
for addr, city, state in zip(single['BorrowerAddress'], single['BorrowerCity'], single['BorrowerState']):
loc = geolocator.geocode(addr+ ',' +city+','+ state, timeout=None)
if (loc is None):
loc = geolocator.geocode(city+','+ state, timeout=None)
elif(loc.latitude is None or loc.longitude is None):
coords.append([100,100])
else:
coords.append([loc.latitude, loc.longitude])
'''
# setting the starting latitude, longitude and zoom.
loans_map = folium.Map(location=[40.1469622455993, -88.97747884394317], zoom_start=4.45, tiles='Stamen Terrain')
# this funciton returns a color based on the value of damage property
def rank_amount(amount):
if amount == 0:
return 'lightgreen'
elif amount > 0 and amount < 0.25:
return 'green'
elif amount > 0.25 and amount < 0.5:
return 'yellow'
elif amount > 0.5 and amount < 0.75:
return 'orange'
elif amount > 0.75:
return 'red'
else:
return 'black'
for coords, amount in zip(coords_to_test['coords'], coords_to_test['PctUndisbursed']):
folium.CircleMarker(
location = coords,
radius=2,
fill=True,
color=rank_amount(amount)).add_to(loans_map)
loans_map
While the coordinates aren't perfect, by zooming into Illinois we can see a clearler picture of where the borrowers live who request loans. Looking at the map, the borrowers who still have all of their loan undisbursed seem to live in more urban-based areas (like near Chicago, Springfield, Champaign, and St. Louis), while loans that are more fully disbursed are in rural areas. We'll keep this in mind later on, when we do some prediction algorithms.
Now that we've completed our Exploratory Data Analysis, we can now dive into our hypothesis testing and machine learning algorithms. The two main attributes we'd like to look at for our testing is OriginatingLender and NAICSCode, since both of these attributes seemed to have some significant outliers when visualizing the data. We'll start off by running logistic regression on our outlier lenders and outlier NAICS codes to check whether there is a correlation between these values and whether a borrower still has an undisbursed balance.
First, let's see if it is true that there is a relationship between the originating lender a business chooses to use and whether they still have undisbursed funds. Since these are categorical variables, we will conduct a logistic regression analysis using the 'OriginatingLender' and 'RemainingUndisbursed' columns. Since we will be using logistic regression, we used one-hot encoding to check if the originating lender was any of our previously determined outlier lenders.
The results of this analysis will help us see if there is truly a relationship between these values.
analysis = vals.copy()
analysis['RemainingUndisbursed'] = 0
analysis.loc[vals['UndisbursedAmount'] != 0, 'RemainingUndisbursed'] = 1
analysis['isCPFBank'] = 0
analysis['isBSDBank'] = 0
analysis['isHSBBank'] = 0
analysis['isPCDFIBank'] = 0
analysis['isBCBank'] = 0
analysis['isCBBank'] = 0
analysis['isLBNABank'] = 0
analysis.loc[analysis['OriginatingLender'] == 'Capital Plus Financial, LLC', 'isCPFBank'] = 1
analysis.loc[analysis['OriginatingLender'] == 'BSD Capital, LLC dba Lendistry', 'isBSDBank'] = 1
analysis.loc[analysis['OriginatingLender'] == 'Harvest Small Business Finance, LLC', 'isHSBBank'] = 1
analysis.loc[analysis['OriginatingLender'] == 'Prestamos CDFI, LLC', 'isPCDFIBank'] = 1
analysis.loc[analysis['OriginatingLender'] == 'Benworth Capital', 'isBCBank'] = 1
analysis.loc[analysis['OriginatingLender'] == 'Customers Bank', 'isCBBank'] = 1
analysis.loc[analysis['OriginatingLender'] == 'Leader Bank, National Association', 'isLBNABank'] = 1
Xtrain = analysis[['isCPFBank', 'isBSDBank', 'isHSBBank', 'isPCDFIBank', 'isBCBank', 'isCBBank', 'isLBNABank']]
ytrain = analysis[['RemainingUndisbursed']]
# building the model and fitting the data
log_reg = sm.Logit(ytrain, Xtrain).fit()
print(log_reg.summary())
Reading the summary table, we can see that our initial hypothesis is true! There is a relationship between the originating lender and whether the loan has been disbursed. We can first note that the LLR p-value is 1.000. This means that our analysis does a complete job of describing the data. Now we can interpret the data by looking at the P>|z| column of the statsmodels summary. The 0.000 p-value indicates there is a 0% chance that this data happened by chance, meaning that there is a relationship between whether the loan was disbursed and whether the originating lender is one of the top lenders found earlier.
Originating lender wasn't the only variable we found. Let's do another analysis based on the top NAICS Codes to see if our earlier hypothesis is true.
analysis['isHairstylist'] = 0
analysis['isBarber'] = 0
analysis['isDriver'] = 0
analysis['isCaterer'] = 0
analysis['isRemodeler'] = 0
analysis['isRetail'] = 0
analysis['isPersonalCare'] = 0
analysis['isJanitor'] = 0
analysis['isLandscaper'] = 0
analysis['isBSS'] = 0
analysis['isDaycare'] = 0
analysis['isBA'] = 0
analysis['isSF'] = 0
analysis['isArtist'] = 0
analysis['isFSC'] = 0
analysis['isTrucker'] = 0
analysis['isHomeHealth'] = 0
analysis['isCourier'] = 0
analysis.loc[analysis['NAICSCode'] == 812112, 'isHairstylist'] = 1
analysis.loc[analysis['NAICSCode'] == 812111, 'isBarber'] = 1
analysis.loc[analysis['NAICSCode'] == 485310, 'isDriver'] = 1
analysis.loc[analysis['NAICSCode'] == 722320, 'isCaterer'] = 1
analysis.loc[analysis['NAICSCode'] == 236118, 'isRemodeler'] = 1
analysis.loc[analysis['NAICSCode'] == 453998, 'isRetail'] = 1
analysis.loc[analysis['NAICSCode'] == 812990, 'isPersonalCare'] = 1
analysis.loc[analysis['NAICSCode'] == 561720, 'isJanitor'] = 1
analysis.loc[analysis['NAICSCode'] == 561730, 'isLandscaper'] = 1
analysis.loc[analysis['NAICSCode'] == 561499, 'isBSS'] = 1
analysis.loc[analysis['NAICSCode'] == 624410, 'isDaycare'] = 1
analysis.loc[analysis['NAICSCode'] == 813910, 'isBA'] = 1
analysis.loc[analysis['NAICSCode'] == 522220, 'isSF'] = 1
analysis.loc[analysis['NAICSCode'] == 711510, 'isArtist'] = 1
analysis.loc[analysis['NAICSCode'] == 722310, 'isFSC'] = 1
analysis.loc[analysis['NAICSCode'] == 484110, 'isTrucker'] = 1
analysis.loc[analysis['NAICSCode'] == 621610, 'isHomeHealth'] = 1
analysis.loc[analysis['NAICSCode'] == 492110, 'isCourier'] = 1
Xtrain2 = analysis[['isHairstylist','isBarber','isDriver', 'isCaterer', 'isRemodeler', 'isRetail', 'isPersonalCare',
'isJanitor', 'isLandscaper', 'isBSS', 'isDaycare', 'isBA', 'isSF', 'isArtist', 'isFSC', 'isTrucker',
'isHomeHealth', 'isCourier']]
ytrain2 = analysis[['RemainingUndisbursed']]
# building the model and fitting the data
log_reg2 = sm.Logit(ytrain2, Xtrain2).fit()
print(log_reg2.summary())
Again we can see there is a relationship between different jobs and whether the loan remains undisbursed.
Now we're going to try and run some predictive analyses. Since we are trying to predict an outcome based on categorical data, we will run the RandomForest and DecisionTree classifiers. To learn more about either classifier, click the link attached to each algorithm. Both of these classifiers work by making decisions based on different features of the dataset. While most of our data is already encoded, we'll add in the 'isRural', 'isHubzone', and 'isLMI' as additional features.
To begin, we will first run a random_grid analysis on several options for each classifier. The output of these searches will help us determine what the best hyperparameters for our analysis should be. We'll then run each classifier according to these hyperparamaters and run a 10-fold cross-validation on each output to determine the accuracy of each classifier.
analysis["isRural"] = False
analysis.loc[analysis['RuralUrbanIndicator'] == 'R', 'isRural'] = True
analysis["isHubzone"] = False
analysis.loc[analysis['HubzoneIndicator'] == 'Y', 'isHubzone'] = True
analysis["isLMI"] = False
analysis.loc[analysis['LMIIndicator'] == 'Y', 'isLMI'] = True
X = analysis[['InitialApprovalAmount', 'isRural', 'isHubzone', 'isLMI', 'isCPFBank', 'isBSDBank',
'isHSBBank', 'isPCDFIBank', 'isBCBank', 'isCBBank', 'isLBNABank', 'isHairstylist', 'isBarber', 'isDriver', 'isCaterer',
'isRemodeler', 'isRetail', 'isPersonalCare', 'isJanitor', 'isLandscaper', 'isBSS', 'isDaycare', 'isBA', 'isSF',
'isArtist', 'isFSC', 'isTrucker', 'isHomeHealth', 'isCourier']]
y = analysis['RemainingUndisbursed']
# Using an 80-20 split on the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestClassifier()
random_grid = {'n_estimators': [10, 20, 30, 40, 50],
'max_features': ['auto', 'sqrt'],
'max_depth': [1, 2, 4, 8, 16],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 4, 9],
'bootstrap': [True, False]}
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 10, verbose=2, random_state=42, n_jobs = -1)
rf_random.fit(X_train, y_train)
# Feed these parameters into the algorithm
rf_random.best_params_
clf_p = RandomForestClassifier(n_estimators = 20, min_samples_split = 5, min_samples_leaf = 4, max_features = 'auto', max_depth = 16, bootstrap = True)
clf_p.fit(X_train, y_train)
clfp_pred = clf_p.predict(X_test)
clf_score = cross_val_score(clf_p, X, y, cv = 10)
cf = DecisionTreeClassifier()
random_grid = {'criterion': ['gini', 'entropy'],
'splitter': ['best', 'random'],
'max_depth': [1, 2, 4, 8, 16],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 4, 9]}
cf_random = RandomizedSearchCV(estimator = cf, param_distributions = random_grid, n_iter = 100, cv = 10, verbose=2, random_state=42, n_jobs = -1)
cf_random.fit(X_train, y_train)
# Feed these parameters into the algorithm
cf_random.best_params_
clf_d = DecisionTreeClassifier(random_state = 42, splitter = 'best', min_samples_split = 2, min_samples_leaf = 9, max_depth = 8, criterion = 'entropy')
clf_d.fit(X_train, y_train)
clfd_pred = clf_d.predict(X_test)
d_clf_score = cross_val_score(clf_d, X, y, cv = 10)
Now let's compare how well each algorithm did:
print(stats.mean(clf_score))
print(stats.mean(d_clf_score))
We can see from here our DecisionTree algorithm was 81% accurate!
Now that we've walked through the entire data science pipeline, let's talk about some of our findings. We learned that there is a colleration between the originating lender a borrower selects for their loan and whether or not their loan is disbursed. We also learned that certain jobs are less likely to have their loan fully disbursed. Using these values as features for our predictors, we were able to correctly predict whether a given borrower would have an outstanding loan disbursement. Borrowers in Illinois should beware when applying for future loans and take note of who they are using as their originating lender.
There are still several opportunities for extended analysis here. First of all, this analysis was confined to Illinois. It would be interesting to see if any of the trends discovered here extend to other states, and if not, what the differences are. There is also an opportunity to bring in more demographic data and see if that has any other effects on loan disbursement. Historically, there has been some conflict regarding race and loans, so mapping out that process to this data may provide new insights as well.
We hope you enjoyed learning about data science and PPP Loan data through this tutorial!